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ABSTRACT 

Roles suggested for the National Assessment of 
Educational Progress (NAEP) in proposed national testing are 
discussed. Recent proposals center around monitoring, evaluation and 
accountability, and serving as a benchniarX for other tests. 
Monitoring has teen the traditional function of the NAEP, and in this 
role, its influence has been substantial. The movement to use this 
assessment for evaluation and accountability is well under way, as 
exemplified in the Trial State Assessment of 1990 and its extensions. 
Proposals to use the NAEP would, in effect, use it as a substitute 
for publisher's norming samples or as a national ai..':hor test. These 
proposals raise the following technical issues: (a) linking a broad 
matrix- sampled test to narrower student-level t«3sts? (2) linking 
taught-to tests to uncorrupted tests? (3) keeping the NAEP itself 
uncorrupted? and (4) erroneous evaluations based on insufficient 
data. It is concluded that the NAEP remains best suited for its 
original monitoring function. There is a 13-item list of references. 
(SLD) 
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Current proposals for national testing, while in some respects path- 
breaking, are in other ways a continuation of the reforms of the 1980s. Those 
reforms left control at the state level, but they were nonetheless national in 
that a few principal elements-in particular, increases in eictemally 
mandated testing, heightened consequences for test scores, and sti%ned 
requirements for graduation from high school-were common to the 
initiatives in many states. The current debate echoes the 1980s themes of 
stiffened standards and greater reUance on externally mandated testing, but 
the focus of the debate has become more clearly national. Even though 
formal decision-making power still rests with the states, poHdes are now 
being formulated at the national level-for example, by the Administration, 
the National Education Goals Panel, the National Council on Education 
Standards and Testing, and the Congress. 

Central to many of the proposed new reforms is national testing. Some 
proposals call simply for one or more national tests, while others call for a 
national system of independent, "voluntary" exams (voluntary for whom, one 
should ask) that will be linked in some manner via national standards. 

When Michael Kean first suggested this symposium, the 
Administration's proposal called for using the National Assessment of 
Educational Progress (NAEP) as an interim national examination until a 
new set of "American Achievement Tests" could be readied. This aspect of 
the administrations proposal seems to have faded into the background, and 
none of the other leading proposals for a national examination system would 
use NAEP in that manner. Nonetheless, NAEP remains a central part of 
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nuyor proposals, and the roles that are proposed for it highlight some of the 
ms^or technical and political Uiemes of the current debate. 

"WF' VERSUS ^THEYT 

Before turning to the NAEP, I would like to digress for a moment 
about the title of this symposium. Educational Assessment: Are the 
Politicians Winning? 

This tit'e might suggest "we" versus "they:" politicians trying to fiddle 
inapprop'-'dtely with assessment, while we technically knowledgeable good 
guys hold the line, or try to, against unreahstic expectations or even outright 
misuse of tests. Unfortunately, in my eicperience, the teams have not lined 
up so tidily. Certainly, some politicians have called for questionable uses of 
tests and have paid too little heed to expert advice and historical evidence. 
But so but so have some social scientists. The opposing side-those who have 
voiced objections to the proposed national testing schemes or urged greater 
attention to technical concerns or possible undesirable side-efiects-also has 
mixed membership. In my view, many people in the research and testing 
communities were slow to address publicly the difficult issues raised by the 
proposed national testing systems and hesitant to criticize them. Indeed, to 
the extent that matters have slowed down enough for more reasoned 
consideration of the proposals-and I am not confident ^at we are yet 
assured of much of a breather-some of the people who deserve the greatest 
credit are politicians, most notably Representatives Bill Goodling and Dale 
Kildee and their staffs. One of the most striking moments in the debates of 
the National Council on Education Standards and Testing occurred when tiie 
Congressional members of the Council distributed a letter insisting that the 
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Council s call for national testing be held in abeyance until issues of validity, 
reliability, and fairness were adequately addressed. 

THREE ROLE» FOR NAEP 

Tuning back to the NAEP itself: the functions of NAEP have already 
changed substantially from its early days, and current proposals would 
transform them further. To sort out the recent proposals, it is helpful to 
distinguish three basic roles: ( 1) monitoring; (2) evaluation and 
accountability; and (3) serving as a benchmark for other tests. 

Monitoring aggregate trends in pexformance 

This is of course the traditional role of NAEP and was until recently its 
only function: describing what American youth know and can do, and how 
those proficiencies change over time. Traditionally, NAEP was designed to 
monitor achievement in the nation as a whole, and its reporting was limited 
to the national populations of students or youth at specified ages and a very 
small number of subgroups, such as regions and racial/ethnic groups. 
Reporting at other poUtical levels, such as the state level, was avoided by 
design. 

How valuable is NAEP as a monitoring tool? Some participants in the 
current debate disparage the usefulness of merely monitoring achievement, 
but NAEP's influence as a monitoring tool has been substantial indeed. The 
current reform movement and that of the past decade arose because of 
widespread dissatisfaction with the performance of American students. That 
dissatisfaction rests in large part on a veiy few indicators of achievement: 
the NAEP, the SAT {however inappropriately), and a small number of 
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international studies. Similarly, where do the proponents of current reforms 
obtain information that the reforms of the 1980s were insufficient and need 
to be replaced with new initiatives, including national testing? Again, the 
NAEP. While many state and local tests have shown sizable gains in recent 
years, the NAEP, the integrity of which was protected because no one taught 
to it, has shown only very slight improvements (e.g., linn and Dunbar, 1990). 

The importance of maintaining NAEP or some other vehide for 
monitoring aggregate performance gradually has become more widely 
acknowledged over the past year. For example, the final report of the 
National Council on Education Standards and Testing calls for maintaining 
a separate system of tests such as NAEP for monitoring national trends. 
Nonetheless, other uses are still proposed for NAEP. 

Evaluation and accountability 

The movement to use NAEP not just for monitoring, but also for 
evaluation and accoimtability, is already well underway. 

The principal step in this direction so far was the initiation of state 
comparisons using NAEP ("state NAEP" for short). This began with the Trial 
State Assessment (TSA) of grade-8 mathematics in 1990 and is scheduled by 
law for gradual expansion. The motivations for state NAEP were 
undoubtedly diverse, but the desire to judge the quality of state's educational 
programs and to hold policymakers or educators accoimtable for differences 
in performance was prominent among them. For example, the Alexander- 
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James report, which was perhaps the most influential call for state; NAEP, 
argued: 

Today state and local administrators are encountering rising 
public demand for thorough information on the quaUty of their 
schools, allowing comparison with data from other states and 
districts and with their own historical records. Respondii^ to 
calls for greater accountability, state officials have increasingly 
turned to the national assessment for assistance (Alexander and 
James, 1987; emphasis added). 

The interpretations given to the first TSA results when they were 
released last June were varied, but many observers used them to evaluate 
the quality of school systems. Even though observers suggested a variety of 
non-educational causes as well, they offered a potpourri of putative 
educational causes of between-state differences. A report on interpretations 
by policymakers and leading media noted the following explanations of 
between-state differences: ability grouping; ihe shortage of minority 
teachers; poorly focused curricula; "ri^d, centrally controlled schools;" 
television; parental involvement; the proportion of two-parent families; the 
proportion of students in large cities; and the proportion of students in 
poverty (Jaeger, 1992). One state education department of education official 
immediately announced an intent to scrap the state's general math 
curriculum. 

Several of the most important recent proposals are straightforward 
extensions of the precedent set by state NAEP. For example, the National 
Assessment Governing Board (NAGB) recently proposed removing all 
restrictions on state NAEP and rescinding the statutory ban on reporting of 
local districts' performance on NAEP. The Administration's now dormant 



ERIC 



NAEP and National Tasting 



6 

7 



DRAFT 



request to use NAEP as an interim national achievement test would have 
extended the trend yet further, to the level of individual students. 

NAEP as Rosetta stone 

A third m^or proposed use of NAEP is to be the national benchmark 
that would darify the meaning of scores on otiier teste. These proposals 
would in effect use the NAEP as a substicute for publisher's nonning samples 
or as a national andior test. 



One proposal that gained considerable prominence recentily would 
have created 'T^AEP-like" teste thsd could be used to provide scores on 
individual studente, wlme the real NAEP would be reserved for use at the 
aggregate level. Immediately before being appointed to his current position 
as Secretary of Education, for example, Lamar Alexander proposed to 
establish a center at the University of Tennessee, of which he was then 
President, that would produce NAEP-like teste that stetes and districte could 
use to score individual studente. Later, the National Center for Education 
Statistics began exploring the potential for federal funding of such an effort. 
The phrase "NAEP-like" could have a variety of disparate meanings, but the 
intent was that the teste should be similar to NAEP in ways that would 
facilitate the interpretetion of individual's scores in terms of national NAEP 
resulte. In effect, NAEP would have served as a substitute for the norms 
offered with current commercial norm-referenced teste. Both of these specific 

0 

initiatives seem to be at least donnant. 



Other proposals would use NAEP to benchmark stetes' resulte on their 
own assessmente. The Kentucky legislature, for example, recently 
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established a legal requirement that the state's assessments be linked to 
NAEP. Initially, this wUl be accompUshed by an equating study using TSA 
results and special supplementary NAEP samples. Other states are 
considering linkage as well, and NAGB is considering passible chani^es in 
poUcy to fadUtate, and perhaps evaluate, linkages of this sort. 

A more unusual proposal would use NAEP to set international norms 
for performance. ETS has recently proposed that data f*Dm the ^cond 
Uitemational Assessment of Educational Progress be used to estimate the 
proportion of students in foreign countries who score above each of the 
a priori achievement levels-basic, profident, and advanced-that NAGB has 
recenUy estabUshed for the reporting of NAEP results. In theory, that would 
permit states to compare their performance, not only with the national norms 
currently provided by national NAEP, but also with norms for a large 
number of diverse coimtries. 

A FEW TECHNICAL ISSUES 

Although these proposed new uses for NAEP may seem relatively 
straightforward, they actually raise a number of difficult technical issues. I 
will note four today. None of the three has received suffident attention in 
the poUcy debate about national testing. 

Linking a broad, matrix-sampled test to narrower, student-level 
tests 

One of the strengths of NAEP is its breadth of coverage. One factor 
that makes this possible is its "BIB-spiraled" design-for simplidty, a variant 
of matrix sampling. Each student is administered only a fraction of the total 
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items in a domain such as mathematics, but the questions administered to 
each student nonetheless cover a variety of subdomains, sudi as 
measurement and alg^ebra and functions. In addition, in recent assessments, 
each student has been tested in only one sulgect area. The total assessment 
therefore can be broad even if individual testing time is kept short and 
students are administered time-consuming tasks such as some performance 
assessments. Information for each student is limited to a single domain, and 
information about that domain-let alone the subdomains~is quite 
unreliable. Those disadvantages, however, are relatively unimportant when 
the assessment is designed, as NAEP is, to provide information only at the 
aggregate level. 

Tests used to obtain scores for individuals, however, mast meet a 
different set of standards, particularly when stakes are high. Unlike NAEP, 
such tests must include enough items in each domain to provide rehable and 
valid estimates of each student's performance. If each student is 
administered only a subset of the test, or if multiple forms are used for other 
reasons, the forms administered to different students must meet high 
standards of comparability-a constraint that NAEP does not face. Unhke 
NAEP, these tests must provide each student with a test of eveiy subject area 
for which scores are desired. 

As a consequence of these considerations, tests used to provide scores 
for individuals are likely to require much more time to administer than 
NAEP, but even with more testing time, such tests are likely to be narrower 
than a good matrix-sampled test such as NAEP. Given that, what does it 
mean to say that an individual-level test is "NAEP-like?" In what sense is 
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performance on such a test really comparable to NAEP? If two states use 
different "NAEP-like" individual tests, each of which is a subset of the 
NAEP's content, to what degree are they likely to measure the same things? 

liinking **taught-to" tests to uncorrupted tests 

We know that tests that are used for accoimtabiUty tend to be taught 
to in ways that produce inflated scores (e.g., Cannell, 1987; Koretz, Linn, 
Dunbar, and Shepard, 1991; linn, Graue, and Sanders, 1990). All other 
things being equal, narrower tests are likely to be more susceptible than 
otiiers to inflation. NAEP, on the other hand, remains uncorrupted, at least 
for the moment; it is protected by its breadth, careful test security 
procedures, and«until recently~a lack of interest in teaching to it. This may 
help explain the fact that NAEP has shown relatively flat trends in 
performance in recent years, while many local and state tests have shown 
marked if perhaps illusory progress (e.g., Linn and Ihmbar, 1990). 

How are local and state tests, many of which will be corrupted if they 
are used for accountabiUty, to be tinked to the as yet tmcomipted NAEP? 
For example, if a state test used for accountability is equated to NAEP in the 
first year of its use, how will that equating relationship change over time as 
students are coached on the content of the state test but not on the content of 
the NAEP? To what extent will improvement on the state's test really 
indicate progress on the NAEP? 

Keeping the NAEP itself uncorrupted 

There is growing awareness in the pohcy community of the value of 
keeping NAEP itself uncorrupted, lest we lose our only frequently collected. 
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nationally representative, and uninflated indicator of the achievement of 
American youth. There is no agreement, however, about the precautions that 
must be taken. Now that states are ranked (and evaluated) by the media on 
the basis of NAEP scores, will the breadth of the NAEP and the test security 
procedures suffice to keep the test uncorrupted? What about the NAGB 
proposal to pennit use of NAEP at the local level? At this point, we have no 
firm answers, and consequences of inadequate caution could be large indeed. 

Erroneous "evahiatioiifi^* based on insufiEicient data 

The issue raised by using NAEP to evaluate programs is simple: 
neither NAEP nor most other lai^e-scale assessment programs provide 
meaningful data on school effectiveness. The rankings of states need not 
indicate differences in effectiveness and provide no reUable information 
about the factors that influence achievement (Koretz, 1991) 

One consequence is apparent in the interpretations of state NAEP 
noted earher: people attribute differences in mean performance to whatever 
factors they find appeaUng. This risks substantial haroL Inadequate 
programs will oflen look good and will be emulated; effective ones will 
sometimes be scrapped. Teachers facing unrealistic evaluations based on 
cross-sectional data will continue to be tempted~as they have for a decade 
now~to find shortcuts, some illegitimate, for raising scores. 

The interpretations of state NAEP noted earlier were based on simple, 
unadjusted cross-sectional differences among states. There are many reasons 
why such comparisons cannot tell us ansrthing meaningful about school 
effectiveness. Educational factors are confounded with non-educational 
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differences among states; the data provide no information about student 
growth in achievement; and the data do not indicate anything about 
students* educational histories (even which states they spent most of their 
school years in). Data about educational factors that influence achievement 
are sparse and are partially at the wrong level of aggregation; many of the 
important factors vary at the level of districts, schools, teachers, or even 
specific classes. Data were provided only for one grade and would include 
only three grades if tiie system were fully implemented, but rankings need 
not be consistent from grade to grade. 

Nor would a more refined use of NAEP data would provide more 
meaningful information about school effectiveness. For example, why not 
adjust states' scores to control for confounding differences in demographics 
and other factors? One a)uld, of course, but the results are unlikely to be 
meaningful. NAEP, Uke most large-scale assessment programs, does not 
provide enough of the needed information. What would be needed is 
sufficient data to estimate how students in one state would score if they were 
subjected to the educational policies of another state but remained the same 
in all other respects, and NAEP does not even approximate that level of 
detail. It includes, for example, no longitudinal data on educational 
experiences or outcomes and includes only weak information about non- 
educational variables that are known to influence achievement. Moreover, 
even if NAEP could rehably identify states with better educational programs, 
it cannot tell us whidi aspects of those programs matter.^ 

^ Experiences with the use of adjusted scores at the local and state level have 
generally been discouraging. Ad(}usted rankings are often found to be hi^Iy inconsistent, 
varying markedly from year to year and across grade levels, subjiect areas, and even the 
statistical methods used to adjust the scores (e.g., Frechtling, 1^; Guskey and Kifer, 1990; 
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CONCLUSION 

The National Assessment of Educational Progress remains best suited 
for its original purpose: monitoring the performance of America s youth. In 
that function, unlike other roles that have been proposed for it in recent 
years, NAEP has no dose substitute. It has been valuable as a tool for 
monitoring achievement in the past, and it can remain so if we maintain it 
properly. To use NAEP for other roles, however, raises difficult technical 
issues and, in some cases, could jeop^xiize NAEP's effectiveness in meeting 
its primary purpose. 



Kippel. 1981; Mandeville and Anderson, 1987; Matthews. Soder, Ramey. and Sanders. 1^1; 
Rowan and Denk. 1983). The causes of these inconsistencies have not been fully explored, 
but the inadequacy of background data and the lack ofbngitudinal data are likely to be 
important factors. 
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